This is a quick walkthrough demonstrating how to generate SWNE plots alongside the Pagoda2 pipeline using a 3k PBMC dataset as an example.
To save time we will be using the pre-computed Pagoda2 object pbmc3k_pagoda2.Robj, which can be downloaded here.
First let’s load the required libraries
library(pagoda2)
library(swne)
library(Matrix)
Next let’s load the Pagoda2 object
p2 <- readRDS("/media/Home_Raid1/yan/swne/Data/pbmc3k_pagoda2.Robj")
Most scRNA-seq pipelines only use a subset of highly overdispersed genes for analysis. We’ll pull out those variable genes here, as well as the cluster labels.
## Pull out variable genes
n.od.genes <- 1.5e3
var.info <- p2$misc$varinfo; var.info <- var.info[order(var.info$lp, decreasing = F),];
od.genes <- rownames(var.info[1:n.od.genes,])
length(od.genes)
## [1] 1500
## Pull out clusters
clusters <- p2$clusters$PCA$multilevel
levels(clusters)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
The easiest way to generate an SWNE embedding is to use the wrapper function RunSWNE
## Run SWNE
genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
"FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
swne.embedding <- RunSWNE(p2, k = 14, var.genes = od.genes, genes.embed = genes.embed)
## calculating variance fit ... using gam [1] "1500 variable genes to use"
## running PCA using 1500 OD genes ..
## Loading required package: irlba
## .. done
## Initial stress : 0.22005
## stress after 10 iters: 0.08860, magic = 0.092
## stress after 20 iters: 0.07003, magic = 0.500
## stress after 30 iters: 0.06859, magic = 0.500
## stress after 40 iters: 0.06850, magic = 0.500
## Plot SWNE
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = clusters,
do.label = T, label.size = 3.5, pt.size = 1.5, show.legend = F,
seed = 42)
Now we’ll go through the SWNE embedding process step by step
First, let’s pull out the counts, and log-scale and adjust the variance of each gene
norm.counts <- ExtractNormCounts(p2, obj.type = "pagoda2", rescale = T, rescale.method = "log")
## calculating variance fit ... using gam
dim(norm.counts)
## [1] 8545 2696
We use the FindNumFactors function to identify the optimal number of factors to use. This function can be slow for large datasets, since it iterates over different values of k, so a simple “hack” is to just set k equal to the number of significant principal components.
k.range <- seq(2,16,2) ## Range of factors to iterate over
k.err <- FindNumFactors(norm.counts[od.genes,], k.range = k.range, n.cores = 8, do.plot = T)
We then run the NMF decomposition. We can initialize the NMF using either Independent Component Analysis (ica), Nonnegative SVD (nnsvd), or a completely random initialization. ICA is the best option for most datasets. The output of RunNMF is a list of the gene loadings (W) and NMF embedding (H).
k <- 14
nmf.res <- RunNMF(norm.counts[od.genes,], k = k, init = "ica", n.cores = 8)
nmf.scores <- nmf.res$H
Compute the SNN matrix from the PCA embedding
p2$calculatePcaReduction(nPcs = 20, odgenes = od.genes)
pc.scores <- t(p2$reductions$PCA)
snn <- CalcSNN(pc.scores, k = 20, prune.SNN = 1/20)
Runs the SWNE embedding. The three key parameters are alpha.exp, snn.exp, and n_pull, which control how the factors and neighboring cells affect the cell coordinates.
alpha.exp <- 1.25 # Increase this > 1.0 to move the cells closer to the factors. Values > 2 start to distort the data.
snn.exp <- 1.0 # Lower this < 1.0 to move similar cells closer to each other
n_pull <- 3 # The number of factors pulling on each cell. Must be at least 3.
swne.embedding <- EmbedSWNE(nmf.scores, snn, alpha.exp = alpha.exp, snn.exp = snn.exp,
n_pull = n_pull, proj.method = "sammon", dist.use = "cosine")
## Initial stress : 0.21581
## stress after 10 iters: 0.07796, magic = 0.461
## stress after 20 iters: 0.07031, magic = 0.500
## stress after 30 iters: 0.06909, magic = 0.500
## stress after 40 iters: 0.06899, magic = 0.500
For now, let’s hide the factors by setting their names to the empty string "". We’ll interpret them later
swne.embedding$H.coords$name <- ""
To help with interpreting these cell clusters, let’s pick some key PBMC genes to embed.
genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
"FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
Since we only ran NMF on the overdispersed genes, we need to project the rest of the genes onto the NMF projection to get gene loadings for all genes.
nmf.res$W <- ProjectFeatures(norm.counts, nmf.scores, n.cores = 8)
Now we can embed the key PBMC genes onto the visualization and remake the plot
swne.embedding <- EmbedFeatures(swne.embedding, nmf.res$W, genes.embed, n_pull = 3)
Let’s make the SWNE plot with the key genes embedded. The closer a cell or a cluster is to a gene, the higher the expression level. We set a seed for reproducible cluster colors, so that every plot will use the same colors to label the clusters.
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = clusters, do.label = T,
label.size = 3.5, pt.size = 1.5, show.legend = F, seed = 42)
We can validate this by overlaying the expression of one of these key genes onto the plot.
gene.use <- "CD8A"
gene.expr <- norm.counts[gene.use,]
FeaturePlotSWNE(swne.embedding, gene.expr, gene.use, alpha.plot = 0.4, label.size = 3.5, pt.size = 1.25)
We can also make a t-SNE plot for comparison.
tsne.scores <- p2$embeddings$PCA$tSNE
PlotDims(tsne.scores, sample.groups = clusters, pt.size = 0.75, label.size = 3.5, alpha = 0.3,
show.legend = F, seed = 42, show.axes = F)
We can also interpret the factors by using the gene loadings matrix. Here, we extract the top 3 genes for each factor by gene loading. Since NMF tends to create a parts-based representation of the data, the factors often correspond to key biological processes or gene modules that explain the data.
gene.loadings <- nmf.res$W
top.factor.genes.df <- SummarizeAssocFeatures(gene.loadings, features.return = 3)
head(top.factor.genes.df)
## assoc_score feature factor
## 1 0.9071141 FTH1 factor_1
## 2 0.6815767 FTL factor_1
## 3 0.6094287 LTB factor_1
## 4 0.6622717 RPL34 factor_2
## 5 0.5853592 RPL39 factor_2
## 6 0.5732522 ACTG1 factor_2
And finally, we can make a heatmap to visualize the top factors for each gene
gene.loadings.heat <- gene.loadings[unique(top.factor.genes.df$feature),]
ggHeat(gene.loadings.heat, clustering = "col")
Extract cluster colors for compatibility with other plotting methods (i.e. Monocle)
color.mapping <- ExtractSWNEColors(swne.embedding, sample.groups = clusters, seed = 42)
color.mapping
## 7 1 8 2 4 6 3
## "#CD9600" "#FF61CC" "#F8766D" "#7CAE00" "#C77CFF" "#00A9FF" "#00BFC4"
## 5
## "#00BE67"